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Abstract 

The aim of this paper is a theoretical study of a cache system in order to optimize 
proxy cache systems and to modernize construction principles including prefetching 
schemes. Two types of correlations, Zipf-like distribution and normalizing condi- 
tions, play a role of the fundamental laws. A corresponding system of equations 
allows to describe the basic effects like ratio between construction parts, steady- 
state performance, optimal size, long-term prefetching, etc. A modification of the 
fundamental laws leads to the description of new effects of documents' renewal in 
the global network. An internet traffic caching system based on Zipf-like distribu- 
tion (ZBS) is invented. The additional module to the cache construction gives an 
effective prefetching by lifetime. 

Key words: Zipf-like distribution, cache optimization, principles of cache 
construction, renewal of Web documents, long-term prefetching 



1 Inroduction 

Rapid development of computer technologies in the end of the last millennium 
leads to appearance of a virtual world with its own laws. Unfortunately, in this 
period too few attention was given to studying of fundamental principles of 
virtual life. 
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The main constructions of the virtual world base on the concrete algorithms 
intended for providing vital functions. The easiest and obvious principles were 
used for constructing and modernizing of these algorithms. During their im- 
plementation weren't any time for studying and optimization. Often their low 
performance was compensated with growing power of computers. The main 
thing was that the new constructions of the virtual world allowed to manage 
the raising problems. 

Now days we can afford some respite and part of forces should be transferred 
to studying of fundamental laws and to optimization of vital systems on a base 
of recent knowledge. Frankly speaking, any research area may be considered as 
scientific field if and only if its basic principles are enveloped in a mathematical 
form and new rules may be predicted on a base of known facts. 

The aim of this paper is a theoretical study of a cache system in order to 
find a way for optimization of proxy systems and to modernize construction 
principles. Principles that should be used for a model is discussing as well. A 
universal model can be easy generalized to any applications based on Zipf-like 
distribution like prefetcnig schemes, Content distribution Networks (CDN) [7], 
peer-to-peer systems [12,13], Internet search engine, etc. 

The rest of the paper is organized as follows. Section 2 provides some back- 
ground information about model parameters. Section 3 presents the analytical 
laws and the special points of theoretical model. Section 4 describes the main 
elements of cache construction. In Section 5 we examine steady-state or lim- 
iting performance of static caching schemes when Web documents were not 
changing and new documents were not being generated. Section 6 discusses the 
ways of optimization of cache systems. New principles of cache construction 
are formulated in Section 7. Section 8 discusses the mathematical description 
of renewal of Web documents. The benefit of web prefetching is the contest of 
Section 9. 



2 Model parameters 

In this section definitions of variables are given and basic approaches and 
terminus has been analyzed. 

In todays common web configurations, the proxy server exists between clients 
and web servers. Clients send all requests to a global network as it is shown 
on the Fig. 1. Some documents are requested several times and therefore they 
should be held in a cache system. 

As it was reported in Refs. [2,3,10] the relative frequency of requests to Web 
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Fig. 1. Scheme of the proxy cache 

pages follows Zipf-like distribution [16]. This distribution states that the rel- 
ative probability of a request for the i'th most popular page is 

A 

Pi 



a 



where A = p\ is the probability of the most popular item and a is positive 
exponential value less then unity. 

Let the users ask for K documents during time t: 

K(u out ,t) = Cv out t. (2) 



This result depends on aggregated bandwidth u out , on time t and a constant 
C . The constant C from Eq. (2) is an inverse proportion of a mean size E{C) 
of documents received from the global network: 



In Ref. [15] a number of received documents K is defined by request stream 
from a population of iV users 

K{\N,t) = \Nt, (4) 



where A is average client request rate. 

Let p documents be unique and M be equal to quantity of documents that 
can be requested from the cache system as it has shown on the Fig. 2. Then 
an efficiency of the system or a hit ratio can be defined as 

, m 

H= k —^=j-dx (5) 
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Fig. 2. Special points of cache system 



the ratio of non-unique cacheable documents k — p = k j Y -^dx to their total 
number K. 



Wolman et al. [15] examined steady-state or limiting performance of caching 
schemes. Their model of steady-state performance assumes that a cache can 
store all cacheable documents in the Web and there isn't any capacity misses 
in workload. The probability that a requested document is cacheable is p c , 
(k = p c K). With this assumptions the maximal rates of H is identical with 
p c , i.e. the ideal hit rate for cacheable documents Hi = H/p c would approach 
100%. 

A key difference between the models presented by Breslau at al [3] and by 
Wolman et al [15] is that document rate of change has been incorporated 
into the model rather than assuming that documents are static. In order to 
describe this effect an additional multiplier has been included to a integral 
from Eq. (5): 

\N Pi /(\N Pi + ti), (6) 



where /i is an exponential parameter of interarrival times for object requests 
and updates, Pi is proportional to l/i a . This multiplier throws off one effective 
query of any document from cache during its lifetime T c h = l//i, that is the 
interval between two consecutive modifications of object. Such an updating 
request must be redirected to the global network to get the renewed Web page. 



3 The analytical laws and the special points 



An analytical model is constructed with a goal to predict behaviour of the 
investigated systems on the base of the analytical laws written in a mathe- 
matical form and to find ways for optimization. Mathematical constructions 
of computer models remind laws of the thermodynamics and the molecular 
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physics in many respects. There are two directions of modeling: macroscopic 
and microscopic approaches. 

In the first case the corresponding laws are formulated for parameters de- 
scribing cache systems as an indivisible object. The following values should 
be considered as macroscopic parameters: hit ratio H, Zipf exponent a, the 
number of unique documents p, M, etc. 

The microscopic description assumes an analysis on a level of transmitting IP 
packets, of requests to the global network, i.e. huge number of the events. In 
this case investigators have to apply the theory of stochastic processes such as 
Poisson's distribution or Markov's chain. The main problem of this approach 
is a right interpretation of analytical results and their generalization, i.e in a 
limit passage from microscopic values to macrolaws. 

Usually the macroscopic dependencies operate with values that are measured 
directly. Unfortunately, corresponding laws are not yet found for the main part 
of the virtual world therefore a particular attention should be given to their 
search. Proxy cache is a pleasant exception to the general rule. 

Two types of correlations pretend to a role of fundamental laws described 
cache systems [5]: 

• A Zipf-like distribution, see Eq. (1) 

• Normalizing conditions or a sum of the probability to request the universe 
of 1 < n < k objects. 

Mentioned laws could be applied to the special points of Zipf distribution. As 
it is shown on the Fig. 2 there are a few special points but only two of them, M 
and p, are used for construction of the theory. Then the Zipf-like distribution 
leads to 



Normalizing conditions for the first M and p documents from a cache are 



M 
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(9) 
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Here Hi is an ideal (steady-state) performance that assumes unlimited capac- 
ity. For a real system the Eq. (9) has been transformed to 



H = p, 




(11) 



i 



where Sk is a number of cache objects requested repeatedly, i.e. i?j > 2 = 
Pi/c) for i < St- 

Five Egs. (7-11) allow to describe the basic effects, to predict parameters value, 
ways of optimization, etc. Modification of this system of five equations would 
lead to describing of new effects. 



4 Elements of cache construction 

This section discusses which elements are compulsory for cache system. Also 
possible ratios between these elements should be investigated. 

Three main parts are distinguished in any cache system: 

• A kernel Sk that contains popular documents with t?j > 2. 

• An accessory part S u that keeps unpopular documents requested from the 
Internet once, i.e. i?j = 1. 

• A managing part S m that contains statistics of requests and rules for re- 
placement of cache objects. 

A system of the Eqs. (7-11) determines mathematical correlations between 
cache elements Sk, S u , S m . 



First of all it is necessary to write the expressions for experimental search of 
a With help of the Eqs. (7,8) 



The correlation between p and k from system of Eqs. (8,10) gives 



Another useful relation is a solution of a system of Eqs. (7,9) concerning 



p = 



2 1/ai M. 



(12) 
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Fig. 3. Possible a values 



variable M: 



M = 



(l-a 3 )H 
2 



K. 



(14) 



This means that only each 15th document should be stored in the cache 
among all documents requested from the global network (for typical values 
a > 0.7, H > 35%). 

It should be noted that three Eqs. (12-14) give the different values of a 



These differences 0:3—0:2! 02—^1 are of order 0.04-=-0.05 and may be graphically 
explained, see Fig. (3). As a rule the value o 3 from Eq. (14) is found from 
experimental data and it is used for the further calculations. 

Analyzing log files of proxy cache a mean lifetime t u can be calculated for 
those documents, which popularity is i?j = 1. Such a statistic determines also 
lifetime T e ff of cache objects with a citing index i?j = 2, i.e. those items that 
have been stored in proxy cache, requested one time from a proxy and have 
been deleted subsequently. 

It is easy to see that a number of documents in a cache kernel Sk is a value 
M(T e ff) corresponding to the time T e ff, and S u depends on p,M,t u : 



The following ratio between the part Sk, S u has been found with help of 



«i < «2 < £*3- 



(15) 



S k = M(T eff ), 
S u =p(t u ) - M{t u ), 



(16) 
(17) 



Eqs. (2), (12): 



Sk _ T eff 



(18) 



S u (2Va - 1)1 



u 
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The Eq. (18) gives a criteria for optimal using of a storage capacity. This cri- 
teria allows to plane an experiment: how does efficiently of cache replacement 
algorithm depend on correlation of the proxy parts? It should be noted this 
result as well as the experiment plan was the consequence of the theoretical 
study. 

The analysis of log files from Samara State Aerospace University proxy [6] 
leads to the facts that variables t u and T e ff are directly proportional to the 
cache size S e ff/u int and can be considered as coincided values: 

T eff ~t u . (19) 



In other words the kernel and accessory parts are approximately correlated as 
1:2 or less then 40% of storage capacity has been used for the basic goal to 
store the repeatedly requested documents. 



5 Steady-state performance (static Web) 



In this section we examine steady-state or limiting performance of static 
caching schemes when Web documents were not changing and new documents 
were not being generated. Wolman et al model [15] of steady-state perfor- 
mance assumes that a cache can store all cacheable documents in the Web 
and there aren't any capacity misses in the workload. With this assumption, 
they conceive that in the long term the hit rate Hi for cacheable documents 
would approach 100%. However, considerable part of cacheable documents 
p — M is requested from the global network only one time as it is shown on 
Fig. 2 that illustrate Zipf-like distribution. 

More accurate expression for ideal hit ratio Hi may be written with help of 
Eqs.(12,13,14) to in consecutive order k,p,M 

Hi < 2 {a ~ 1)la , (20) 



that gives 75% for a = 0.7. 

The alternative expression for ideal hit ratio Hi follows from Fig. 2: 
v — M 

Hi<l- V -^- (21) 



that gives 80% [6]. 
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The hit rate H of web cache is considered to grow in a log-like fashion as a 
function of cache size [1,3,4]. Wolman's work [15] shows that there is no benefit 
in additional cache size beyond a certain size. However such an statements are 
impeccable only for an extremely huge sizes. 

The cache administrators believe that a production cache should store about 
2-3 days of traffic. Such an recommendation can be found on the cite of the 
Measurement Factory Inc., that tested well-known cache products [11]. 

From analyzes of the Eq. (11) for cache performance the following dependence 
appears 



that allows to talk about power fashion [6] in size's area of a few days traffic. 
This fact has been confirmed by experimental research [6]. 



6 Optimization of cache systems 

A fine tuning of cache that increases hit ratios by only several percentage 
points would be equivalent to a several-fold increase in cache size. In order to 
determine an optimal cache size Kelly and Reeves [9] are guided on economical 
methods like monetary cost of memory and bandwidth. 

The aim of our theoretical investigation is a search of dependencies between 
cache size S — + S u and other macroscopic parameters like internal band- 
width Vi nt , maximum hit ratio Hi, Zipf's exponent a, etc. 

As it is shown in Ref. [5] a ratio of effective size of cache system S e ff to 
aggregated bandwidth of external links v int can be considered as a constant: 



Our goal is to calculate a lower limit r(/x, a) when the performance mounts 
to the 35% level (H e ff > 35%) of well working system. 

Eqs. (20,21) give the following estimation for a real system with a > 0.7, 




(22) 




= T. 



(23) 



Vint 



Pc = 0.6: 



PcHi 



> H eff > 35% 



(24) 



V2 



9 



Eqs.(9,ll) allow to calculate the correlation between documents number in a 
real cache kernel and a number M max corresponding to limiting performance 
Hi. 

Sk = 2 i/(2( a -D) Mmax (25) 



where 



(26) 



Parameter T ch = \j\i u = 186 days is the time between renewal of unpopular 
documents [15]. If H e ff = 0.35, then 

r^ u ,a) = 2^2^-^±-^. (27) 



It is easy to calculate r = 6, days for a = 0.8. 



7 New principles of cache construction 



The main result of any research of cache system should be finding a way for 
increasing hit ratio. We see the two basic directions for achievement that goal. 
The first conjecture is that the hit rate H increases with growth of Sk/S u . The 
second one is based on the general feature of the existing cache algorithms 
that the only documents requested two times and more during the time t u 
are included in the kernel Sk- Therefore we conclude that caching algorithm 
based on the rigid tie of the main cache parameters to the time t u is low 
effectiveness. Our approach is to separate the management block based on the 
requests' statistics from the time t u . 

The current situation can be resolved with the new construction principles for 
cache systems. The most important feature is that fixed ratio between cache 
parts must be guaranteed. The statistics of requests must be kept long time 
for all cacheable documents including delayed ones. An inseparable part of the 
new construction is a replacement algorithm with a metric function based on 
Zipf law. 

The basic principles of system of internet traffic caching based on a Zipf-like 
distribution (ZBS) could be formulated as follows: 

(1) Three main construction parts are distinguished in ZBS-system: 
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Fig. 4. The illustration of variables included in metric function 

• A kernel Sk that contains popular documents with i?, > 2. 

• An accessory part S u keeps unpopular documents requested from the 
Internet once, i.e. with i?j = 1. Size of the accessory part S u must not 
exceed 10% from the full size of the ZBS cache S e ff. 

• A managing part S m contains statistics of requests as a base for a re- 
placement policy. The control information including statistics are kept 
in S m for a long time interval t s > 10(S e ff/ui nt ), where Vi nt is an ag- 
gregated bandwidth of the external links. The value t s must exceed one 
month, the maximal time interval is restricted by a half of an year. A 
document popularity $j takes into account all requests that made by 
users during the time t s . 

(2) An additional necessity to request a cached document from the global 
network is caused by document modification. The newly received docu- 
ment' item is stored in the kernel Sk and the corresponding frequency 
parameter assumes to be equals unity i?j = 1. 

(3) An replicate metric Cj = is calculated for each document stored in 
the kernel Sk, where T\ is a time period spent from the last modification 
of the document and $j is the corresponding request's frequency. The 
Fig. 4 contains a graphical explanation of these variables. A document 
with the biggest Cj is deleted when the kernel S k is overfilled. 

(4) A metric Cj = -^4- is applied for increasing byte hit ratio. Here Ei is the 
size of the corresponding document. 

(5) The document from the accessory part S u is deleted 

• when it is placed to the kernel Sk after the second request, 

• accordingly recently-based policies, e.g. FIFO, when the accessory part 
S v is overfilled. 
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The described cache scheme improves the hit ratio by a few percent according 
Eq. (22) because of the kernel size Sk increases twice from 40% to 90% when 
other things being equal. For the huge cache size the profit consists of in an 
economy of the half of disk storage. 



8 Renewal of Web documents 



In the previous section our analysis has been based on the model presented by 
Breslay et al [3] and extended by Wolman et al. [15] to incorporate document 
rate of change. The Wolman model yields formulas to predict steady-state 
properties of Web caching systems parameterized by population size N, pop- 
ulation request rates A, document rate of change fi, size of object universe n 
and a popularity distribution for objects. 

The key formulas from Wolman et al. 



c "=lcp[T^r (28) 

n 

C=f—dx (29) 



yield Cn, the aggregate object hit ratio considering only cacheable objects. 
Document rates of change \i were considered to take two different values, one 
for popular documents \x v and another for unpopular documents n u . An addi- 
tional multiplier from Eq. (28) throws off one effective query to any document 
from cache during time T c h = \j\x between its changes. Such an request up- 
dates the existing objects when it has been modified. 

In present paper we assume that the document rate of change depends on 
its popularity %. Mathematical equivalent of this assertion is that the steady- 
state process of demand cache is again described by Zipf distribution with a 
small an as it is shown on Fig. 5. 

The difference between an ideal performance of a cache system and its real 
value has been conditioned by the renewal of documents in the global network 

AH = (y^^-HK^/K (30) 



The ideal Zipf-like distribution corresponds to the upper line on the Fig. 5 but 
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Table 1 
The renewal parameters 



a 


a R 


AH 


H 




fJ-u 


0.72 


0.7 


2.3% 


32.04% 


1/6.2 days 


1/202 days 



the real hit ratio H determines as 
a R = 1 - 2M/HK 



(31) 



Now it is easy to find /Xj 



(p/i) 



■ St 



St 



(32) 



It should be noted again that the experiment's plan is the consequence of the 
theoretical study. The corresponding results have been calculated from the 
proxy' trace of Samara State Aerospace University [6] and has been assembled 
in the Table 1. 

Finally, the system of Eqs. (7-11) must be modernized by change of variables 
a on apt- 



9 Long-term Prefetching 



One way to further increase the cache hit ratio is to anticipate future requests 
and prefetch these objects into a local cache. The benefit of web prefetching 
is to provide low retrieval latency for users, which can be explained as high 
hit ratio. 

This section examines the costs and potential benefits of long-term prefetch- 
ing for content distribution. In traditional short-term prefetching, caches use 
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recent access history to predict and prefetch objects likely to be referenced in 
the near future. In contrast, long-term prefetching uses long-term steady-state 
object access rates and update frequencies to identify objects to replicate to 
content distribution locations. 

Using analytic models and trace-based simulations, Venkataramani at al. [14], 
Jiancr at al. [8] examine algorithms for selecting objects for long-term prefetch- 
ing. They use threshold-based algorithms and fetch those objects with values 
that exceed the threshold. The object selection criterion of prefetching de- 
termines which object to replicate in advance. They have several options of 
criteria such as object popularity and lifetime. 

An algorithm of prefetching by good-fetch was proposed by Venkataramani at 
al. [14]. As for object i, assume the objects lifetime k, its probability of being 
accessed pi, and user request arrival rate denoting how many requests arrive 
per second is a. Then the probability of object i to be accessed before it dies 
is 

PgoodFetch = 1 — (1 — Pi)"*' 1 (33) 

where a x /j is the number of requests arriving during the lifetime of object i. 

Jiang at al. [8] has proposed an algorithm using the apiU value to select objects. 
Objects with the highest value of apiU will be included in the prefetching set, 
meaning those objects with most possible requests will be prefetched in caches. 

The analytical model derived in [14] allows to express the hit ratio of threshold- 
based algorithms. In order to describe this effect an additional multiplier has 
been included to a integral from Eq. (5). Such an fraction 

ff(i) = apih/iapiU + 1) (34) 

is identical to Wolman multiplier from Eq. (6). The fraction (34) represents 
the hit ratio among accesses to the object i. Stated otherwise, this denotes 
the probability of object i being fresh when it is being accessed. It is called a 
freshness factor of object i denoted as ff(i) in [14]. 

The approach developed in the Section 8 allows to calculate freshness factor 
ff(i) as coefficient between hit ratios of demand cache and of ideal perfor- 
mance, see Fig. 5. Taking into account the Eq. (13) and the data from the 
Table 1 it is easy to see 

ffii) = k R /h = (1 - a R )/{\ - a t ) = 0.93. (35) 
For threshold-based algorithms, an object i is always kept fresh by prefetching 
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it whenever there is any change, if its corresponding metric is above its thresh- 
old value T p . Then we will always have a hit on object i, while the hit ratio 
for other objects remains same as ff(i)- In other words the prefetching algo- 
rithms allow to reach highest hit ratio corresponding to upper line on Fig. 5 
to use an additional bandwidth. This additional value has been estimated [14] 

ABW = (1 - ff(i))u int (36) 

as 7% to the mean magnitude of aggregated bandwidth of external links from 
Eq. (23). 

On our opinion the additional module to the cache construction described in 
Section 7 can give an effective prefetching by lifetime: 

• An additional mark N % is introduced for each cache item. It equals the 
number of document modification for the time T ins from installation of 
cache system. 

• The threshold value is T % = Ti ns /N l . 

• If the time from a last modification of the document T\ exceeds correspond- 
ing threshold T l v then the cache fetches this object. 

Mathematical basis of this module is uniform distribution of modification mo- 
ments inside T ins . 



10 Conclusions 

The aim of this paper is a theoretical study of a cache system in order to opti- 
mize proxy cache systems and to modernize construction principles including 
prefetching schemes. Two types of correlations, Zipf-like distribution and nor- 
malizing conditions, play a role of the fundamental laws. A corresponding 
system of equations allows to describe the basic effects like ratio between con- 
struction parts, steady-state performance, optimal size, long-term prefetching, 
etc. A modification of the fundamental laws leads to the description of new 
effects of documents' renewal in the global network. 

The main result of any research of cache systems is finding a way for increasing 
hit ratio. Our theoretical study leads to the criteria for an optimal using of 
storage capacity. This criteria allows to plane several experiment on the base 
of the mathematical analysis. 

The current situation can be resolved with the new construction principles 
for cache systems. An internet traffic caching system based on Zipf-like distri- 
bution (ZBS) is invented. The new system construction consists of the three 
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parts: a kernel for storing popular documents, an accessory part for unpopular 
documents and a management part for keeping requests' statistic. Documents' 
replacement algorithm based on a metric function that uses Zipf-like distri- 
bution is a part of the new ZBS system. The additional module to the cache 
construction described in Section 7 gives an effective prefetching by lifetime. 
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